Title

Home credit Default risk

Group number: 02

Team members: (clockwise in the picture)

  1. Kiran Kanrandikar(kikarand@iu.edu)
  2. Yashwitha Reddy(ypondug@iu.edu)
  3. Rahul(rgomathi@iu.edu)
  4. Sathish(satsoun@iu.edu) Screen Shot 2022-04-18 at 4.13.39 PM.png

The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

Some of the challenges

  1. Dataset size
    • (688 meg uncompressed) with millions of rows of data
    • 2.71 Gig of data uncompressed

Kaggle API setup

Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,

! kaggle competitions files home-credit-default-risk

It is quite easy to setup, it takes me less than 15 minutes to finish a submission.

  1. Install library

For more detailed information on setting the Kaggle API see here and here.

Dataset and how to download

Back ground Home Credit Group

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

There are 7 different sources of data:

Downloading the files via Kaggle API

Create a base directory:

DATA_DIR = "../../../Data/home-credit-default-risk"   #same level as course repo in the data directory

Please download the project data files and data dictionary and unzip them using either of the following approaches:

  1. Click on the Download button on the following Data Webpage and unzip the zip file to the BASE_DIR
  2. If you plan to use the Kaggle API, please use the following steps.

Mount Gdrive

Imports

Extract Zip Files, ignore if unzipped already

Data files overview

Data Dictionary

As part of the data download comes a Data Dictionary. It named HomeCredit_columns_description.csv

Load Application train


Application test

The application dataset has the most information about the client: Gender, income, family status, education ...

The Other datasets

One Click Setup | Imports | Load datasets

Below cells are redundant and are added to quickly load all datasets in events like kernel failure.

Exploratory Data Analysis

Summary of Application train

Missing data for application train

Distribution of the target column

Correlation with the target column

Applicants Age

Applicants years of employments

Interesting observation: DAYS_EMPLOYED : some rows have value as 365243( equivalent to 1000 years),i.e some people are employed for 1000 years

Applicants occupations

Applicants Income Type

Target vs Gender

Family Status of Loan applicants

Observation: The relationship between TARGET and "EXT_SOURCE_3", "EXT_SOURCE_2", "EXT_SOURCE_1", "DAYS_EMPLOYED" is not linear and monotonic.

Dataset : bureau

Observation: Bureau contains record for all SK_ID_CURR

Bureau: Applicants Days Credit

Applicants Credit history status

Observation: CREDIT_CURRENCY has 4 types but the data majorly contain only one type

Applicants with more than 5, 10, >15 Bureau records

Insights on aggregated data

Dataset: bureau_balance

Observation: we can use the status column to better understand the applicants re-payments behaviour per credit

Questions: What different status code indicates, do they have any siginificane??

Dataset: previous_application

Features can be used from previous_application when grouped by "SK_ID_CURR"

Design Question: How to navigate around situation where no previous_application exits for SK_ID_CURR

Dataset: credit_card_balance

Selecting random SK_ID_PREV to gain insights

Insights on aggregated data

Dataset: POS_CASH_BALANCE

Obserrving Random SK_ID_PREV to gain insights

Insights on aggregated data

Dataset : installments_payment

Obserrving Random SK_ID_PREV to gain insights

Aggreating data.

- [ ] Completed???

Dataset questions

Unique record for each SK_ID_CURR

Previous applications for the submission file

The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.

Histogram of Number of previous applications for an ID

Can we differentiate applications by low, medium and high previous apps?
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)

Design Decisions | Sample Examples | Sample Feature Engineering

Joining secondary tables with the primary table

In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?

Joining previous_application with application_x

We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.

Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:

To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).

When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:

  1. Preprocess each of the non-application data sets, thereby generating many new (derived) features, and then joining (aka merge) the results with the application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]

I want you to think about this section and build on this.

Roadmap for secondary table processing

  1. Transform all the secondary tables to features that can be joined into the main table the application table (labeled and unlabeled)
    • 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments',
    • 'previous_application', 'POS_CASH_balance'

Pandas dataframe aggegration detour

Aggregate using one or more operations over the specified axis.

For more details see agg

DataFrame.agg(func, axis=0, *args, **kwargs**)

Aggregate using one or more operations over the specified axis.

Multiple condition expressions in Pandas

So far, both our boolean selections have involved a single condition. You can, of course, have as many conditions as you would like. To do so, you will need to combine your boolean expressions using the three logical operators and, or and not.

Use &, | , ~ Although Python uses the syntax and, or, and not, these will not work when testing multiple conditions with pandas. The details of why are explained here.

You must use the following operators with pandas:

Sample Feature Engineering | previous_applications

Missing value Analayis

Sample Feature Enginnering | Previous_application analysis

Sample Feature Engineering | Using feature transformer...

Join the labeled dataset

Test Data Feature Engineering

Join the unlabeled dataset (i.e., the submission file)

Convert categorical features to numerical approximations (via pipeline)

Sample Processing pipeline | Known Issues

OHE when previously unseen unique values in the test/validation set

Train, validation and Test sets (and the leakage problem we have mentioned previously):

Let's look at a small usecase to tell us how to deal with this:

This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.

Here is a example that in action:

# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE', 
               'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']

# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
    ])

OHE case study: The breast cancer wisconsin dataset (classification)

Please this blog for more details of OHE when the validation/test have previously unseen unique values.

HCDR preprocessing

bold textABSTRACT

Feature Engineering | Feature Selections | Preprocessing

Train, Valid, Test dataset selection

Feature Enginnering

Testing | Experimental Features

Todo's
Bureau Features
Bureau Balance
application
Credit card balance
Class based feature Transformer
POS_CASH_BALANCE
Installments payment

Data Pipeline

Secondary Tables Aggregation

Auxiliary classes

Modeling

Baseline Model

To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model

Gridsearch with CV

Screenshots of Experimental Analysis

Random forest

Screenshot 2022-04-20 185422.png

Screenshot 2022-04-20 185518.png

Screenshot 2022-04-20 190002.png

Decision Tree Classifier

Screenshot 2022-04-20 190203.png

Screenshot 2022-04-20 191929.png

Screenshot 2022-04-20 192436.png

XGBoost

Screenshot 2022-04-20 192653.png

Screenshot 2022-04-20 193450.png

Resampling

Screenshot 2022-04-20 193916.png

Random Forest after resampling

Screenshot 2022-04-20 194031.png

Screenshot 2022-04-20 194624.png

Screenshot 2022-04-20 194830.png

Decision tree after resampling

Screenshot 2022-04-20 195058.png

Screenshot 2022-04-20 195822.png

XGBoost after resampling

Screenshot 2022-04-20 194900.png

Screenshot 2022-04-20 195850.png

Screenshot 2022-04-20 195922.png

Phase 3

Multi Layer Perceptron (Classification) using PyTorch

Sigmoid is the main activation function used, with additional functions provided for batch normalization and dropout to ensure overfitting is not done.

The optimizer used is that of optim.Adam since it is more efiicient and produces better results than SGD. The (criterion) loss function used is BCEWithLogitsLoss, since this is a binary classifier. With Logits is preferred here over BCELoss since the inputs can be below 0.

Training and Evaluation(Validation)

Testing

MLP for Regression

Multi-headed load system

Tensorboard

Pipeline of MLP

Evaluation metrics

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

The SkLearn roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number.

from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75

AUC score

Submission File Prep

For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:

SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.

Kaggle submission via the command line API

WhatsApp%20Image%202022-04-29%20at%206.51.43%20PM.jpeg

report submission

Click on this link

Model Evaluation

Abstract

Home Credit Default Risk is a project where we determine the credit worthiness of people that have applied for the loans. In previous phases, we had completed basic EDA, Feature Engineering and ran the baseline model for logistic regression and the hyperparameter tuning for XGBoost model. In Phase 2, we have significantly improved our project. We have updated the EDA, implemented robust Feature engineering for all dataset files, and did experimental analysis for hyper-parameter tuning for Logistic Regression, XGBoost and Random Forest Models. We conducted experiments using both original imbalanced data as well as resampled data. After comparison we found out that the XGBoost model was the best model. For the deep learning Pytorch model, we built two MLP models. One for classification and one for regression. The mail goal in this phase has been building a multi-layer perception (MLP) model in PyTorch for loan default classification and using Tensorboard to visualize the results of training in real time. Each model has 3 layers. Sigmoid activation function is used for classification and ReLU for regression along with Cross entropy loss. We have acheived a Test accuracy of 92 and Test CXE loss of 0.29.

Project Description and Data

Data Description

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders. Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities. Various statistical and machine learning methods can be used to make these predictions.

Data Files Overview:

There are 7 different sources of data:

  1. application_train/application_test: the main training and testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature SK_ID_CURR. The training application data comes with the TARGET indicating 0: the loan was repaid or 1: the loan was not repaid. The target variable defines if the client had payment difficulties meaning he/she had late payment more than X days on at least one of the first Y installments of the loan. Such case is marked as 1 while other all other cases as 0.
  2. bureau: data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.
  3. bureau_balance: monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length.
  4. previous_application: previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.
  5. POS_CASH_BALANCE: monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
  6. credit_card_balance: monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
  7. installments_payment: payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.

Tasks Performed in each phase

Phase 1

EDA and Building a baseline pipeline model

Phase 2

Feature Engineering and Hyperparameter tuning, Feature Selection ensemble methods

Phase 3

Build a multi-layer perception (MLP) model in PyTorch for loan default classification Using Tensorboard to visualise the results of training in real time Screen%20Shot%202022-05-01%20at%203.20.09%20AM.png

Neural Network

Classification: Has 3 layers( 157 input, 64 hidden and 1 ouput) Uses sigmoid activation function Regression: Has 3 layers( 157 input, 64 hidden and 1 ouput) Uses ReLU activation function

Hyperparameters

Classification: WhatsApp%20Image%202022-04-30%20at%2012.47.57%20PM.jpeg

Regression Screen%20Shot%202022-05-01%20at%203.33.26%20AM.png

Data Leakage Control

On secondary datasets, we performed feature engineering and combined it with application train and application test. We chose the top 50 numeric features from the application train based on their connection with the target variable. The final dataset included all of the numeric features as well as the engineering features. We employed pipelines to prevent data leaking during numeric and categorical feature preparation. For Cross Validation, all of the models were designed as a single pipeline with a data pipeline and an estimator.

Modelling Pipelines

Pipeline for MLP classification

Screen%20Shot%202022-05-01%20at%203.37.36%20AM.png

Pipeline for MLP regression

Screen%20Shot%202022-05-01%20at%203.38.11%20AM.png

Experiments and Results

For classification MLP

Screen%20Shot%202022-05-01%20at%203.41.21%20AM.png

WhatsApp%20Image%202022-04-30%20at%2012.48.50%20PM.jpeg

For Regression MLP

Screen%20Shot%202022-05-01%20at%203.44.58%20AM.png

For multi headed load default system

Screen%20Shot%202022-05-01%20at%203.46.40%20AM.png

Test CSE + MSE loss = 8.66 for multi headed load default system

Conclusion

The project's goal is to identify those who will be able to pay back their debts. Based on the applicant's prior applications, credit bureau history, payment installments, and other important criteria such as sources of income, number of family members, dependents, and so on, our Machine Learning model can forecast whether or not the individual should be granted a loan. For target value 1, all of the machine learning models that had been trained using skewed data performed poorly. As evidenced by the confusion matrices, the model was retrained using resampled data, and predictions for Target value 1 improved dramatically. Because we used a smaller selection of data to train the Deep Learning Model, it did not perform as well as typical machine learning models. In this phase 3 we have built an MLP using PyTorch to determine if the candidate is eligible for loan.We achieved a test accuracy of 92% with the test loss of 0.29.Additionally, we have also implemented a MLP model for regression and built a multi headed load default system that combines the loss functions of both the prior models.Subsequently, we have created a pipeline that incorporates all the three models.

References

Some of the material in this notebook has been adopted from here

TODO: Predicting Loan Repayment with Automated Feature Engineering in Featuretools

Read the following: